Skip to content

Conversation

bisgaard-itis
Copy link
Contributor

@bisgaard-itis bisgaard-itis commented Jun 30, 2025

What do these changes do?

  • Joel (from the desktop team) experienced that the api-server would sometimes hang when uploading files concurrently (see issue below).
  • It was already established that this was due to the cancel_on_disconnect decorator.
  • Together with @giancarloromeo we discovered that the issue seems to be that when task.cancel() is called (on a asyncio.Task) there is no guarantee that the task will be cancelled the next time the event loop picks up the task. This is an issue in our construction because, apparently, FasAPI/Starlette doesn't return the response to the user until the poller task is done. The fix introduced here is to use an asyncio.TaskGroup. After syncing with Gemini and @giancarloromeo this seems to be the modern idiomatic way of getting "structured concurrency" in Python. I have tested this by uploading 500 files concurrently to my local deployment.

Related issue/s

How to test

Dev-ops

Food for thought

  • What if the poller tasks finishes first? In that case the handler task will be cancelled. But apparently there is no guarantee that the task will be cancelled next time the event loop picks up the task. So the handler task can live on for a while. So the question is: Is it even worth having the added complexity of the handle_on_disconnect decorator?
  • Note also: A more modern way to do this would be using a asyncio.TaskGroup

After discussing this issue with Gemini, here's the main explanation for the deadlock encountered here:

  Here is the step-by-step flow under this corrected understanding:


   1. Handler Finishes: Your handler_task completes.
   2. Cleanup Begins: The wrapper starts cleaning up the poller_task.
   3. Timeout Occurs: await asyncio.wait_for(poller_task, timeout=3) raises asyncio.TimeoutError because the poller is still stuck in await request.is_disconnected().
   4. Wrapper Finishes: The except block catches the timeout, the finally block runs, and the wrapper function successfully executes return await handler_task. From the
      perspective of your code, the request handler has returned a value and is done.
   5. The Orphaned Task: But the poller_task was not actually killed. The wrapper simply stopped waiting for it. The poller_task is now an "orphan"—still alive and running on
       the event loop, completely detached from its parent.
   6. Server Cleanup Deadlock: The ASGI server (Starlette/Uvicorn) has received the response object from your wrapper. It sends the response back to the client. Now, it needs
       to perform its own internal cleanup for the request. This cleanup involves releasing resources associated with the request object.

  This is where the new deadlock happens.


   * The server's cleanup logic is waiting to fully release the request object and its resources.
   * The orphaned `poller_task` is still alive and holds a reference to that same request object, periodically calling await request.is_disconnected() on it.


  The server cannot fully clean up and close the connection context while your orphaned task is still actively using the request object. The server is essentially waiting
  for the poller_task to finish, but the poller_task will only finish when the client disconnects—which may never happen, as the client has already received the response
  and is likely waiting for the connection to close cleanly.

@bisgaard-itis bisgaard-itis requested a review from pcrespov as a code owner June 30, 2025 14:44
@bisgaard-itis bisgaard-itis self-assigned this Jun 30, 2025
@bisgaard-itis bisgaard-itis added bug buggy, it does not work as expected a:apiserver api-server service labels Jun 30, 2025
@bisgaard-itis bisgaard-itis added this to the Engage milestone Jun 30, 2025
@giancarloromeo
Copy link
Contributor

Now we deterministically exit from the poller loop. 👌

@codecov
Copy link

codecov bot commented Jun 30, 2025

Codecov Report

❌ Patch coverage is 93.47826% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 85.92%. Comparing base (c5371bd) to head (b8e20d2).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8017      +/-   ##
==========================================
- Coverage   88.04%   85.92%   -2.12%     
==========================================
  Files        1907     1372     -535     
  Lines       73282    56786   -16496     
  Branches     1302      650     -652     
==========================================
- Hits        64519    48795   -15724     
+ Misses       8374     7762     -612     
+ Partials      389      229     -160     
Flag Coverage Δ
integrationtests 64.31% <ø> (-0.02%) ⬇️
unittests 86.00% <93.47%> (-0.68%) ⬇️
Components Coverage Δ
pkg_aws_library ∅ <ø> (∅)
pkg_celery_library ∅ <ø> (∅)
pkg_dask_task_models_library ∅ <ø> (∅)
pkg_models_library ∅ <ø> (∅)
pkg_notifications_library ∅ <ø> (∅)
pkg_postgres_database ∅ <ø> (∅)
pkg_service_integration ∅ <ø> (∅)
pkg_service_library 71.80% <93.47%> (+0.10%) ⬆️
pkg_settings_library ∅ <ø> (∅)
pkg_simcore_sdk 85.05% <ø> (-0.12%) ⬇️
agent 93.81% <ø> (ø)
api_server ∅ <ø> (∅)
autoscaling 95.89% <ø> (ø)
catalog 92.34% <ø> (ø)
clusters_keeper 99.13% <ø> (ø)
dask_sidecar 91.81% <ø> (-0.57%) ⬇️
datcore_adapter 97.94% <ø> (ø)
director 76.14% <ø> (ø)
director_v2 77.95% <ø> (-12.96%) ⬇️
dynamic_scheduler 96.27% <ø> (ø)
dynamic_sidecar 90.12% <ø> (ø)
efs_guardian 89.60% <ø> (ø)
invitations 91.44% <ø> (ø)
payments 92.60% <ø> (ø)
resource_usage_tracker 92.13% <ø> (-0.48%) ⬇️
storage 86.42% <ø> (ø)
webclient ∅ <ø> (∅)
webserver 88.19% <ø> (+0.05%) ⬆️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c5371bd...b8e20d2. Read the comment docs.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@giancarloromeo giancarloromeo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a hot afternoon ☀️

Copy link
Member

@pcrespov pcrespov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, add test in service-library that reproduce the scenario tatyou are trying to cover

@bisgaard-itis
Copy link
Contributor Author

bisgaard-itis commented Jul 1, 2025

After a very useful conversation with my new best friend Gemini, here's the best understanding I have of how to handle asyncio tasks. I will "publish" it here because I think it is very insightful:

The Critical Difference: Where the Program Hangs

So what is the "guarantee" that TaskGroup provides?

The guarantee is not that the task will be killed instantly. The guarantee is structural: The TaskGroup guarantees that your program will not proceed past the async with block, leaving an orphaned task behind.

Let's compare the two scenarios:

Scenario 1: Manual Cancellation (The "No Guarantee" case)

   1 async def my_parent_task():
   2     task = asyncio.create_task(un_cancellable_task())
   3     task.cancel()
   4     # My parent task MOVES ON, thinking it cancelled the child.
   5     # The child is now an ORPHAN, running forever in the background,
   6     # potentially holding resources and causing silent bugs.
   7     print("Parent task finished, but the problem is hidden.")

Here, the guarantee fails because your parent task has a completely incorrect view of the system's state.

Scenario 2: Using TaskGroup

    1 async def my_parent_task():
    2     try:
    3         async with asyncio.TaskGroup() as tg:
    4             tg.create_task(un_cancellable_task())
    5             # Let's say another task in the group finishes,
    6             # triggering the cleanup and cancellation of un_cancellable_task().
    7     except Exception:
    8         pass
    9 
   10     # THIS LINE IS NEVER REACHED.
   11     # The `async with` block itself hangs forever, waiting for the
   12     # un_cancellable_task to finish.
   13     print("Parent task finished.")

Here, the TaskGroup surfaces the problem. It doesn't hide it. The program hangs at the exact point where the misbehaving task is preventing cleanup. You now have a clear
place to debug. The guarantee is that you cannot have "zombie" or "orphan" tasks that you don't know about. The TaskGroup forces you to confront them.

The Real Solution: Combining TaskGroup with Timeouts

So, how do you get the best of both worlds? A guarantee of no orphans, and a guarantee that your parent task won't hang forever?

You wrap the entire TaskGroup in a timeout. This is the ultimate pattern for robust concurrent code.

    1 import asyncio
    2 
    3 async def un_cancellable_task():
    4     print("Un-cancellable task started and will ignore cancellation.")
    5     # This task has no await, so it can't be cancelled.
    6     for i in range(10_000_000_000):
    7         pass
    8     print("This will never be printed.")
    9 
   10 async def main():
   11     print("Starting main task.")
   12     try:
   13         # Python 3.11+ has the cleaner `asyncio.timeout()`
   14         async with asyncio.timeout(2):
   15             async with asyncio.TaskGroup() as tg:
   16                 tg.create_task(un_cancellable_task())
   17                 # The TaskGroup will try to cancel the task when the
   18                 # outer timeout is triggered.
   19 
   20     except asyncio.TimeoutError:
   21         print("The entire operation timed out after 2 seconds.")
   22         print("The un-cancellable task is now orphaned, but we KNOW it.")
   23         print("Our main program flow is safe and can continue.")
   24 
   25     print("Main task has finished.")

This is the pinnacle of asyncio safety:

  1. The TaskGroup ensures all tasks within the operation are managed as a single unit.
  2. The asyncio.timeout() ensures that your application's control flow will always proceed, even if one of the tasks inside the group misbehaves and refuses to cancel.

@giancarloromeo
Copy link
Contributor

Thanks for further investigation, it's similar to the solution adopted for Celery's tasks monitoring. 👌

Copy link
Member

@sanderegg sanderegg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider using RequestCancellationMiddleware, which is tested. Also check if that one is having the problem you discovered.
I would also prefer we re-use that code than have 2 different ones ideally.

@bisgaard-itis bisgaard-itis requested a review from pcrespov July 1, 2025 11:02
@bisgaard-itis bisgaard-itis requested a review from sanderegg July 2, 2025 07:03
@bisgaard-itis
Copy link
Contributor Author

I would still like to have 1 code for this and not 2 different ones. Also, you are aware that @cancel_on_disconnect will also cancel any Starlette BackgroundTask in the request right?

OK, I will refactor to use only one version of the code in both of these. But I will keep both ways: a decorator and a middleware.

Yes, I am aware that @cancel_on_disconnect will cancel a Starlette Background task. We have one endpoint in the api-server where we use those (the logstreaming) and we don't use this decorator there. So that should be fine.

Copy link
Member

@sanderegg sanderegg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit confused now, as we went through these yesterday with @pcrespov and this again looks different. can you maybe sync? then I am happy to approve. thanks and sorry for the mess.

@pcrespov
Copy link
Member

pcrespov commented Jul 5, 2025

@bisgaard-itis I had some interesting experiences with these cancellations in my PR #8014 where I added uvloop also in the testing (up to now we were using uvloop in production and asyncio std loop for testing). We can talk on Monday about it

Copy link
Member

@pcrespov pcrespov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please reassign back to me when checked requested changes. thx :-)

@giancarloromeo giancarloromeo removed their assignment Jul 7, 2025
@sonarqubecloud
Copy link

sonarqubecloud bot commented Aug 7, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🤖-automerge marks PR as ready to be merged for Mergify a:apiserver api-server service bug buggy, it does not work as expected

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ensure POST /files/content doesn't hang in api-server

5 participants